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^ Abstract 

We study randomized variants of two classical algorithms: coordi- 
nate descent for systems of linear equations and iterated projections 
for systems of linear inequalities. Expanding on a recent randomized 
iterated projection algorithm of Strohmer and Vershynin for systems 
^ of linear equations, we show that, under appropriate probability dis- 

tributions, the linear rates of convergence (in expectation) can be 
bounded in terms of natural linear-algebraic condition numbers for 
the problems. We relate these condition measures to distances to ill- 
posedness, and discuss generalizations to convex systems under metric 
^ regularity assumptions. 

1 Introduction 



The condition number of a problem measures the sensitivity of a solution 
to small perturbations in its input data. For many problems that arise in 
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numerical analysis, there is often a simple relationship between the condi- 
tion number of a problem instance and the distance to the set of ill-posed 
problems — those problem instances whose condition numbers are infinite [3] . 
For example, with respect to the problem of inverting a matrix A, it is known 
(see [11J, for example) that if A is perturbed to A + E for sufficiently small 
E, then 

" ( - 4+ fr! l r j4 ""<M- i iiii^ii+Q(ii^ 

|| ^4— 1 1| ~~ 

Thus, a condition measure for this problem may be taken as ||A -1 ||. Asso- 
ciated with this is the classical Eckart-Young theorem found in [7|, relating 
the above condition measure to the distance to ill-posedness. 

Theorem 1.1 (Eckart-Young) For any non- singular matrix, A, 
min{\\E\\ : A + E is singular} = . 

We are typically concerned with relative condition numbers as introduced 
by Demmel in [3j. For example, with respect to the problem of matrix 
inversion, the relative condition number is k(A) := \\A\\ ||y4 _1 ||, the commonly 
used condition measure. 

Condition numbers are also important from an algorithmic perspective. 
In the above example of matrix inversion, for instance, the sensitivity of a 
problem under perturbations could come into prominence regarding errors 
in either the initial problem data or accumulated computational error due 
to rounding. Hence, it seems natural that condition numbers should affect 
algorithm speed. For example, in the context of linear programming, Renegar 
defined a condition measure based on the distance to ill-posedness in [19]- 
in a similar sense as the Eckart-Young result-and showed its effect on the 
convergence rate of interior point methods in |20j. 

For another example, consider the problem of finding a solution to the 
system Ax = b where A is a positive-definite matrix. It was shown in [Tj 
that the steepest descent method is linearly convergent with rate ( ^j~| ) 2 
and that this bound is asymptotically tight for almost all choices of initial 
iterates. Similarly, it is well known (see [8]) that the conjugate gradient 
method applied to the same problem is also linearly convergent with rate 
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From a computational perspective, a related and important area of study 
is that of error bounds. Given a subset of a Euclidean space, an error bound 
is an inequality that bounds the distance from a test vector to the specified 
subset in terms of some residual function that is typically easy to compute. 
In that sense, an error bound can be used both as part of a stopping rule 
during implementation of an algorithm as well as an aide in proving algo- 
rithmic convergence. A comprehensive survey of error bounds for a variety 
of problems arising in optimization can be found in [15] . 

With regards to the problem of solving a nonsingular linear system Ax = 
b, one connection between condition measures and error bounds is immediate. 
Let x* be a solution to the system and x be any other vector. Then 

||ar — ar*|| = \\A- x A(x-x*)\\ = \\A~\Ax - b)\\ < \\A- x \\\\Ax - 

so the distance to the solution set is bounded by a constant multiple of the 
residual vector, \\Ax — b\\, and this constant is the same one that appears in 
the context of conditioning and distance to infeasibility. As we discuss later, 
this result is not confined to systems of linear equations. 

As a result, error bounds and the related condition numbers often make 
a prominent appearance in convergence proofs for a variety of algorithms. In 
this paper, motivated by a recent randomized iterated projection scheme for 
systems of linear equations due to Strohmer and Vershynin in [22] , we revisit 
some classical algorithms and show that, with an appropriate randomization 
scheme, we can demonstrate convergence rates directly in terms of these nat- 
ural condition measures. The rest of the paper is organized as follows. In 
the remainder of this section, we define some notation used throughout the 
rest of this paper. In Section [3j we consider the problem of solving a lin- 
ear system Ax = b and show that a randomized coordinate descent scheme, 
implemented according to a specific probability distribution, is linearly con- 
vergent with a rate expressible in terms of traditional conditioning measures. 
In Section |4j we build upon the work of Strohmer and Vershynin in [22] by 
considering randomized iterated projection algorithms for linear inequality 
systems. In particular, we show how randomization can provide convergence 
rates in terms of the traditional Hoffman error bound in [10] as well as in 
terms of Renegar's distance to infeasibility from [IB] . In Section [5j we con- 
sider randomized iterated projection algorithms for general convex sets and, 
under appropriate metric regularity assumptions, obtain local convergence 
rates in terms of the modulus of regularity. 
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The classical, deterministic versions of the simple algorithms we consider 
here have been widely studied, in part due to the extreme simplicity of each 
iteration: their linear convergence is well-known. However, as remarked 
for linear systems of equations in [22] . randomized versions are interesting 
for several reasons. The randomized iterated projection method for linear 
equations from which this work originated may have some practical promise, 
even compared with conjugate gradients, for example (22]. Our emphasis 
here, however, is theoretical: randomization here provides a framework for 
simplifying the analysis of algorithms, allowing easy bounds on the rates of 
linear convergence in terms of natural linear-algebraic condition measures, 
such as relative condition numbers, Hoffman constants, and the modulus of 
metric regularity. 

2 Notation 

On the Euclidean space R™, we denote the Euclidean norm by || ■ ||. Let 
denote the column vector with a 1 in the i th position and zeros elsewhere. 

We consider m-bj-n real matrices A. We denote the set of rows of A by 
{aj , . . . , a^} and the set of columns is denoted {Ai, . . . , A n }. The spectral 
norm of A is the quantity ||A||2 := maxiui^i and the Frobenius norm 

is \\A\\f '■= Yl,ij a %- These norms satisfy the following inequality [IT] : 

(2.1) \\A\\ F <^\\A\\ 2 . 

For an arbitrary matrix, A, let ||A _1 ||2 be the smallest constant M such 
that || Ac || 2 > T7 1| x || 2 for all vectors x. In the case m > n, if A has singular 
values 0i > a 2 > ■ ■ ■ > a n , then M can also be expressed as the reciprocal of 
the minimum singular value o~ n , and, if A is invertible, this quantity equals 
the spectral norm of A^ 1 . 

The relative condition number of A is the quantity k(A) := ] | ^4 1 1 2 1 1 ^4 1 1 1 2 ; 
related to this is the scaled condition number introduced by Demmel in [I] , 
defined by k(A) := 1 1 ^4 1 1 ^ 1 1 ^4 1 1 1 2 - From this, it is easy to verify (using the 
singular value decomposition, for example) the following relationship between 
condition numbers: 

1 < -\J- < k(A). 

Now suppose the matrix A is n-by-n symmetric and positive definite. The 
energy norm (or A-norm), denoted || • H^, is defined by \\x\\a '■= Vx T Ax. The 
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inequality 

(2.2) \\x\\ 2 A < \\A- l \\ 2 ■ \\Ax\\ 2 forallxeR" 

is useful later. Further, if A is simply positive semi-definite, we can generalize 



Inequality 2.2 



(2.3) x t Ax < _±_\\ Ax \\2 

where X(A) is the smallest non-zero eigenvalue of A. We denote the trace of 
A by tr A: it satisfies the inequality 

(2.4) \\A\\ F > ^£ 

Given a nonempty closed convex set S, let Ps(x) be the projection of 
x onto S: that is, Ps(x) is the vector y that is the optimal solution to 
mines' || x — z\\ 2 . Additionally, define the distance from x to a set S by 

d(x, S) = min ||a; — z\\ 2 = \\x — Ps(x)\\. 
The following useful inequality is standard: 

(2.5) \\y - x\\ 2 - \\P s (y) - x\\ 2 > \\y - P s (y)\\ 2 for all x e S, y G R n . 



3 Randomized Coordinate Descent 

Let A be an n-by-n positive-definite matrix. We consider a linear system 
of the form Ax = b, with solution x* = We consider the equivalent 

problem of minimizing the strictly convex quadratic function 

f(x) = -x T Ax — b T x, 

and note the standard relationship 

(3-1) f(x)-f(x*) = \\\x-x*\\ 2 A . 

Suppose our current iterate is x and we obtain a new iterate x + by per- 
forming an exact line search in the nonzero direction d: that is, x + is the 
solution to min tgR f(x + td). This gives us 

(b-Axfd 

X +~ X+ d T Ad d 
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and 



( o \ ft \ ft *\ l \\ * II 2 1 n * .,2 ((Ax-bfdf 

(3.2) f(x+) - f{x ) = -\\x+ -x \\ A = -\\x -x \\ A - 



2" T 11/1 2" 2d T Ad ' 

One natural choice of a set of easily-computable search directions is to choose 
d from the set of coordinate directions, {ei, . . . , e n }. Note that, when using 
search direction e^, we can compute the new point 

bj - afx 

iC _|_ iC I ^% 
Q>ii 

using only 2n + 2 arithmetic operations. If the search direction is chosen 
at each iteration by successively cycling through the set of coordinate direc- 
tions, then the algorithm is known to be linearly convergent but with a rate 
not easily expressible in terms of typical matrix quantities (see [8] or [T7]) . 
However, by choosing a coordinate direction as a search direction randomly 
according to an appropriate probability distribution, we can obtain a con- 
vergence rate in terms of the relative condition number. This is expressed in 
the following result. 

Algorithm 3.3 Consider an n-by-n positive semidefinite system Ax = b 
and let Xq G R n be an arbitrary starting point. For j = 0,1,..., compute 

b% ctj Xj 

Q>ii 

where, at each iteration j , the index i is chosen independently at random 
from the set {1, . . . , n}, with distribution 

P {i = * } = 

Notice in the algorithm that the matrix A may be singular, but that 
nonetheless an > almost surely. If A is merely positive semidefinite, solu- 
tions of the system Ax = b coincide with minimizers of the function /, and 
consistency of the system is equivalent to / being bounded below. We now 
have the following result. 
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Theorem 3.4 Consider a consistent positive-semidefinite system Ax = b, 
and define the corresponding objective and error by 



T A 7 T 

-x Ax — b x 



S(x) = f(x)—mmf. 



Then Algorithm \3.3\ is linearly convergent in expectation: indeed, for each 

A(A). 



iteration j = 0, 1, 2, . . 

E[5(x j+ x 



Xj ] < (1 



tiA 



5{xj). 



In particular, if A is positive- definite and x* = A 1 b, we have the equivalent 
property 



E[\\x j+1 - x 



* 112 



\A I -^3 



XA < 1 



1 



|A _1 || 2 tr^4 



I * II 2 

\Xj x \\ A . 



Hence the expected reduction in the squared error \\xj — x*\\\ is at least a 
factor 

1 1 

1 1= , < 1 



hk(A) 



nk(A) 



at each iteration. 

Proof Note that if coordinate direction is chosen during iteration j, then 
Equation |3.2| shows 



(pi - afx j) 2 



f(x j+1 ) = f(Xj) - 



Hence, it follows that 



mx „) | Xj] = s( Xj ) - t^Jhz^jl = s(Xj) _ _1_ 

Using Inequality 2^3 and Equation |3.1 , we easily verify 



\Axj - 



-\\Axj-bf >X(A)S(x 



3/1 



and the first result follows. Applying Equation |3 . 1 1 provides the second result. 
The final result comes from applying Inequalities |2.1| and |2.4 □ 



The simple idea behind the proof of Theorem |3.4 is the main engine 



driving the remaining results in this paper. Fundamentally, the idea is to 
choose a probability distribution so that the expected distance to the solution 
from the new iterate is the distance to the solution from the old iterate minus 
some multiple of a residual. Then, using some type of error bound to bound 
the distance to a solution in terms of the residual, we obtain expected linear 
convergence of the algorithm. 

Now let us consider the more general problem of finding a solution to a 
linear system Ax = b where A is an m x n. More generally, since the system 
might be inconsistent, we seek a "least squares solution" by minimizing the 
function \\Ax — b\\ 2 . The minimizers are exactly the solutions of the positive- 
semidefinite system A T Ax = A T b, to which we could easily apply the previous 
algorithm; however, as usual, we wish to avoid computing the new matrix 
A T A explicitly. Instead, we can proceed as follows. 

Algorithm 3.5 Consider a linear system Ax = b for a nonzero m-by-n 
matrix A. Let xq G R n be an arbitrary initial point and let ro = b — Axq be 
the initial residual. For each j = 0, 1, . . . , compute 

Af rj 



3 ~ \\M\ 2 

where, at each iteration j , the index i is chosen independently at random 
from the set {1, . . . , n} ; with distribution 



P{i = k} = {] r (fc = l,2,...,n) 



F 



(In the formula for ctj, notice by assumption that Ai ^ almost surely.) 

Note that the step size at each iteration can be obtained by directly 
minimizing the residual in the respective coordinate direction. However, the 
algorithm can also be viewed as the application of the algorithm for positive 
definite systems on the system of normal equations, A T Ax = A T b, without 
actually having to compute the matrix A T A. Given the motivation of directly 



minimizing the residual, we would expect that Algorithm 3.5 would converge 



to a least squares solution, even in the case where the underlying system is 
inconsistent. The next result shows that this is, in fact, the case. 
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Theorem 3.6 Consider any linear system Ax = b, where the matrix A is 
nonzero. Define the least-squares residual and the error by 



f{x) = \\\Ax-b\\ 2 
8(x) = f(x) — mmf. 



Then Algorithm 3.5 is linearly convergent in expectation to a least squares 



solution for the system: for each iteration j = 0, 1, 2 . . 

E[5(x j+1 )\ Xj ] < ( 1 " A p^ ) )^)- 
In particular, if A has full column rank, we have the equivalent property 
E[\\x j+1 -x\\ 2 A T A \xj} < (l - -x\\ata 

where x = (A T A)~ l A T b is the unique least-squares solution. 
Proof It is easy to verify, by induction on j, that the iterates 



actly the same as the iterates generated by Algorithm |3.3[ when applied to 
the positive semi-definite system A T Ax = A T b, and furthermore that the 
residuals satisfy rj = b — Axj for all j = 0, 1, 2, . . .. Hence, the results follow 
directly by Theorem |3.4| □ 



By the coordinate descent nature of this algorithm, once we have com- 
puted the initial residual r and column norms {||^4i|| 2 }" = i, we can perform 
each iteration in 0(n) time, just as in the positive-definite case. Specifically, 
this new iteration takes 4n + 1 arithmetic operations, compared with 2n + 2 
for the positive-definite case. 



For a computational example, we apply Algorithm |3.5| to random 500 x n 
matrices where each element of A and b is an independent Gaussian random 
variable and we let n take values 50, 100, 150 and 200. 
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200 


0.99928 
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Note that in the above examples, the theoretical bound provided by Theorem 



3.6 predicts the actual behavior of the algorithm reasonably well. 



4 Randomized Iterated Projections 

Iterated projection algorithms share some important characteristics with co- 
ordinate descent algorithms. Both are well studied and much convergence 
theory exists; a comprehensive overview on iterated projections can be found 
in [5]. However, even for linear systems of equations, standard develop- 
ments do not provide bounds on convergence rates in terms of usual condi- 
tion numbers. By contrast, in the recent paper (22], Strohmer and Vershynin 
obtained such bounds via the following randomized iterated projection al- 
gorithm, which also provided the motivation for our work in the previous 
section. 
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Algorithm 4.1 Consider a linear system Ax = b for a nonzero m-by-n 
matrix A. Let xo G R n be an arbitrary initial point. For each j — 0,1, ... , 
compute 

\(t; 



X j+ i — Xj 



where, at each iteration j , the index i is chosen independently at random 
from the set {1, . . . , m}, with distribution 



\a k \ 12 



P{* = k} = (fc = l,2,...,m) 



Notice that the new iterate Xj+\ is simply the orthogonal projection of 
the old iterate Xj onto the hyperplane {x : ajx — At first sight, the 
choice of probability distribution may seem curious, since we could rescale the 
equations arbitrarily without having any impact on the projection operations. 
However, following [22J, we emphasize that the aim is to understand linear 
convergence rates in terms of linear- algebraic condition measures associated 
with the original system, rather than in terms of geometric notions associated 
with the hyperplanes. This randomized algorithm has the following behavior. 

Theorem 4.2 (Strohmer-Vershynin, [22J) Given any matrix A with full 
column rank, suppose the linear system Ax = b has solution x* . Then Algo- 



rithm 4-1 converges linearly in expectation: for each iteration j = 0, 1, 2, 



E[\\x j+1 -x*\\l \ Xj] < (l - —-^^Wxj - x*\\l 

We seek a way of generalizing the above algorithm and convergence result 
to more general systems of linear inequalities, of the form 



(4.3) 



aj x < h (i e /. 



<j 



aj x = bi (i G J= 



where the disjoint index sets J< and 1= partition the set {1,2, . . . ,m}. To 
do so, staying with the techniques of the previous section, we need a corre- 
sponding error bound for a system of linear inequalities. First, given a vector 
x G R n , define the vector x + by {x + )i = max{xj,0}. Then a starting point 
for this subject is a result by Hoffman in [TO]. 
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Theorem 4.4 (Hoffman) For any right-hand side vector b G R m , let Sb be 



the set of feasible solutions of the linear system (4-3). Then there exists a 



constant L, independent of b, with the following property: 
(4.5) x G R n and S b ^ d(x, S b ) < L\\e(Ax - 

where the function e: R m — > R m is defined by 

e(y)i = 

In the above result, each component of the vector e(Ax — b) indicates the 
error in the corresponding inequality or equation. In particular e(Ax — b) = 
if and only if x G Sb- Thus Hoffman's result provides a linear bound for the 
distance from a trial point x to the feasible region in terms of the size of the 
"a posteriori error" associated with x. 




We call the minimum constant L such that property (4.5) holds the Hoff- 



man constant for the system (4.3). Several authors give geometric or alge- 
braic meaning to this constant, or exact expressions for it, including [9], [14] . 
[15] ; for a more thorough treatment of the subject, see [15] . In the case of 
linear equations (that is, J< = 0), an easy calculation using the singular value 
decomposition shows that the Hoffman constant is just the reciprocal of the 
smallest nonzero singular value of the matrix A, and hence equals 1 1 ^4. x 1 1 2 
when A has full column rank. 

For the problem of finding a solution to a system of linear inequalities, 



we consider a randomized algorithm generalizing Algorithm 4.1 



Algorithm 4.6 Consider the system of inequalities (4-3). Let xq be an ar- 
bitrary initial point. For each j — 0,1, ... , compute 

f (ajxj - bi) + (i G 7<) 
3 \ ajxj - hi (i G /=) 

ft 

INI 

where, at each iteration j , the index i is chosen independently at random 
from the set {1, ... , m}, with distribution 

p {i = k} = fe£ (fc = l,2,...,m). 
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In the above algorithm, notice j3j = e(Axj — We can now generalize 
Theorem 14.21 as follows. 



Theorem 4.7 Suppose the system (4-3) has nonempty feasible region S 



Then Algorithm \4-6] converges linearly in expectation: for each iteration 
.7 = 0,1,2,..., 



E[d(x j+1 ,S) 2 | Xj] < (1 



where L is the Hoffman constant. 



L 2 \\Af F 



d(xj, S)' 



Proof Note that if the index % is chosen during iteration j, then it follows 
that 

\\xj+i - P s (x j+ i)\\l < \\x j+1 - P S (Xj)\\l 



e(Axj — 6), 



\ij - Ps{xj)\\l + 



a» - Ps(xj] 



e(Axj - b) 2 e(Axj - b)i T 



(xj - P S (Xj)). 



Note Ps(xj) G S. Hence if i G I<, then a[Ps(xj) < bi, and e(Axj — b)i > 0, 



so 



e(Axj — b)iaj(xj — Ps(xj)) > e(Axj — b)i(ajxj — bi) = e(Axj — b) 2 . 
On the other hand, if i G /=, then afPs(xj) — bi, so 

e(Axj — b)idj(xj — Ps(xj)) = e(Axj — b)i(ajxj — bi) = e(Axj — b) 2 . 
Putting these two cases together with the previous inequality shows 



d(xj+i, S) 2 < d(xj,Sy 



e(A Xj - b) 2 



Taking the expectation with respect to the specified probability distribution, 
it follows that 

E[ d (x j+1 ,sy i Xj ) < d( Xj ,s) 2 - l|e( ^ &) " 2 
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and the result now follows by the Hoffman bound. 



□ 



Since Hoffman's bound is not independent of the scaling of the matrix A, it 
is not surprising that a normalizing constant like \\A\\ 2 F term appears in the 
result. 

For a computational example, we consider linear inequality systems Ax < 
b where the elements of A are independent standard Gaussian random vari- 
ables and b is chosen so that the resulting system has a non-empty interior. 
We consider matrices A which are 500 x n, letting n take values 50, 100, 150 



and 200. We then apply Algorithm 4.6 to these problems and observe the 
following computational results. 
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Another natural conditioning measure for linear inequality systems is 
the distance to infeasibility, defined by Renegar in [18] , and shown in [20] to 
govern the convergence rate of interior point methods for linear programming. 
It is interesting, therefore, from a theoretical perspective, to obtain a linear 
convergence rate for iterated projection algorithms in terms of this condition 
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measure as well. For simplicity, we concentrate on the inequality case, Ax < 
b. To begin, let us recall the following results. 

The distance to infeasibility [18] for the system Ax < b is the number 

fi = inf {max{||AA||, ||A6||} : (A + AA)x < b + Ab is infeasible}. 

Theorem 4.8 (Renegar, [18J, Thm 1.1) Suppose fi > 0. Then there ex- 
ists a point x 6 S satisfying \\x\\ < \\b\\/fjL. Furthermore, any point x G R n 
satisfies the inequality 

< maxjl,^!!} 
fi 



Using this, we can bound the linear convergence rate for the Algorithm 4.6 



in terms of the distance to infeasibility, as follows. Notice first that || 0C j X 1 1 IS 
nonincreasing in j, by Inequality 2.5 Suppose we start Algorithm 4.6 at the 
initial point Xq = 0. Applying Theorem 4.8[ we see that for all j = 1, 2, . . . , 



so 



|^j|| < + \\Xj — x\\ < \\x\\ + \\xq — x\\ < 



d(xj,S) <max{-, 2] ^}\\(Ax j -b) 



fi ^ 

Using this inequality in place of Hoffman's bound in the proof of Theorem 
gives 



4.7 



E[d(xj +1 , sy 



1 



1 



|A||!(max{i «}) 



d(xj, S) 2 . 



Although this bound may not be the best possible (and, in fact, it may not 
be as good as the bound provided in Theorem 4.7), this result simply em- 
phasizes a relationship between algorithm speed and conditioning measures 
that appears naturally in other contexts. In the next section, we proceed 
with these ideas in a more general framework. 



5 Metric Regularity and Local Convergence 

The previous section concerned global rates of linear convergence. If instead 
we are interested in local rates, we can re-examine a generalization of our 
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problem through an alternative perspective of set- valued mappings. Consider 
a set- valued mapping $ : R n =4 R m and the problem of solving the associated 
constraint system of the form b G for the unknown vector x. For 

example, finding a feasible solution to Ax < b is equivalent to finding an x 
such that 

(5.1) beAx + R™. 

Related to this is the idea of metric regularity of set-valued mappings. 
We say the set- valued mapping $ is metrically regular at x for b G <&(x) if 
there exists 7 > such that 

(5.2) d(x,^~ 1 (b)) < 7^(6, $(x)) for all (x, b) near (x,b), 

where = {x : b G $(x)}. Further, the modulus of regularity is the 



infimum of all constants 7 such that Equation 5.2 holds. Metric regularity is 
strongly connected with a variety of ideas from variational analysis: a good 
background reference is [2T] . 

Metric regularity generalizes the error bounds discussed in previous sec- 
tions at the expense of only guaranteeing a bound in local terms. For ex- 
ample, if $ is a single-valued linear map, then the modulus of regularity 
(at any x for any b) corresponds to the typical conditioning measure 
(with = 00 implying the map is not metrically regular) and if $ is a 

smooth single-valued mapping, then the modulus of regularity is the recipro- 
cal of the minimum singular value of the Jacobian, V$(x). From an alterna- 
tive perspective, metric regularity provides a framework for generalizing the 
Eckart- Young result on the distance to ill-posedness of linear mappings cited 
in Theorem Specifically, if we define the radius of metric regularity at x 
for b for a set-valued mapping $ between finite dimensional spaces by 

rad$(x|6) = iiif { : $ + E not metrically regular at x for b + E(x)}, 

where the infimum is over all linear functions E, then one obtains the strik- 
ingly simple relationship (see [H]) 

modulus of regularity of $ at x for b 



rad$(x|6) ' 



We will not be directly using the above result. Here, we simply use the 
fundamental idea of metric regularity which says that the distance from a 
point to the solution set, d(x, $ _1 (6)), is locally bounded by some constant 
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times a "residual" . For example, in the case where <£> corresponds to the linear 



inequality system (5.1), we have that d(b, $(a;)) = \\(Ax — b) + \\ implies that 
the modulus of regularity is in fact a global bound and equals the Hoffman 
bound. More generally, we wish to emphasize that metric regularity ties 
together several of the ideas from previous sections at the expense of those 
results now only holding locally instead of globally. 

In what follows, assume all distances are Euclidean distances. We wish to 
consider how the modulus of regularity of $ affects the convergence rate of 
iterated projection algorithms. We remark that linear convergence for iter- 
ated projection methods on convex sets has been very widely studied: see [SJ, 
for example. Our aim here is to observe, by analogy with previous sections, 
how randomization makes the linear convergence rate easy to interpret in 
terms of metric regularity. 

Let Si, £2, • • • > S m be closed convex sets in a Euclidean space E such that 
fljSj 7^ 0. Then, in a manner similar to [15], we can endow the product space 
E m with the inner product 

m 

((ui,U 2 , ■ ■ ■ ,U m ), (v^Vz, ■ ■ ■ ,V m )} = ^2{Ui,Vi) 

i=l 

and consider the set-valued mapping $ : E — > E m given by 

(5.3) $(x) = (S 1 -x,S 2 -x,...,S m -x). 

Then it clearly follows that x G rijSj G Under appropriate 

regularity assumptions, we obtain the following local convergence result. 



Theorem 5.4 Suppose the set-valued mapping $ given by Equation 5J3 is 
metrically regular at x for with regularity modulus 7. Let 7 be any constant 
strictly larger than 7 and let xq be any initial point sufficiently close to x. 
Further, suppose that Xj+i = Psi{ x j) with probability — fori = l,...,m. 
Then 



E[d(x j+1 ,S) 2 I xj] < (l- ^)d{x 3) Sf. 



Proof First, note that by Inequality 2.5, the distance \\Xj — x\\ is nonin- 
creasing in j. Hence if x$ is sufficiently close to x, then Xj IS clS well for all 



j > 0. Then, again using Inequality 2.5 (applied to the set Si), we have, for 
all points x G S C Si, 

\\ Xj - x\\ 2 - \\xj - P Si {xj)\\ 2 > \\ p sA x j) - A?- 
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Taking the minimum over x G S, we deduce 

d{x v S) 2 - \\ Xj - Ps&iW > d(PsAx,),S) 2 . 



Hence 



i m 

E[d(x j+1 ,S) 2 | Xj) = -J2 d ( P sAxj),S) 2 

i=l 
-r m 

j m 

= d(xj,S) 2 ^2d(xj,Si) 2 

m i=i 

= dix^S) 2 - -rf(0,$(x,)) 2 
m 



< 



(l ^)d(xj,S) 



using the definition of metric regularity. □ 

Note that metric regularity at x for is a slightly stronger assumption 
than actually necessary for this result. Specifically, the above result holds as 



long as Equation ^2 holds for all x near x with 6 = fixed, as opposed to 
the above definition requiring it to hold for all b near b as well. 

For a moment, let m — 2 and consider the sequence of iterates {%j}j>o 
generated by the randomized iterated projection algorithm. By idempotency 
of the projection operator, there's no benefit to projecting onto the same set 
in two consecutive iterations, so the subsequence consisting of different iter- 
ates corresponds exactly to that of the non-randomized iterated projection 
algorithm. In particular, if Xj G Si, then 

d(Ps 2 (xj), S) 2 < d(xj, S) 2 — d(xj, S2) 2 = d(xj, S) 2 — [d(xj, S2) 2 + d(xj, Si) 2 } 

since d(xj, Si) = 0. This gives us the following corollary, which also follows 
through more standard deterministic arguments. 

Corollary 5.5 If $ is metrically regular at x for with regularity modulus 
7 and 7 is larger than 7, then for x sufficiently close to x, the 2-set iterated 
projection algorithm is linearly convergent and 

d(x 3+ i,S) 2 < (l-^dix^S) 2 . 
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Further, consider the following refined version of the m-set randomized 
algorithm. Suppose xq G S\ and i = 1. Then for j = 1,2, ... , let ij be 
chosen uniformly at random from {1, . . . , m}\{ij-i} and Xj + i = Ps^ixj). 
Then we obtain the following similar result. 

Corollary 5.6 If $ is metrically regular at x for with regularity modulus 
7 and 7 is larger than 7, then for x sufficiently close to x, the refined m-set 
randomized iterated projection algorithm is linearly convergent in expectation 
and 1 

E[d(x j+1 ,S) 2 I x h i^x] < (l - ^ m _y_ 2 )d(xj,S) 2 . 

A simple but effective product space formulation by Pierra in [16] has 
the benefit of reducing the problem of finding a point in the intersection of 
finitely many sets to the problem of finding a point in the intersection of 2 
sets. Using the notation above, we consider the closed set in the product 
space given by 

T = SixS 2 x...xS m 

and the subspace 

L = {Ax : x G E} 

where the linear mapping A : E — ► E m is defined by Ax = (x,x,...,x). 
Again, notice that x G HiSi ^> (x, . . . , x) G T fl L. One interesting aspect of 
this formulation is that projections in the product space E m relate back to 
projections in the original space E by 

) G Pt(Ax) <=> Zi G Psi(x) {i — 1) 2, . . • , m) 

(P L (zi, . . . ,z m ))i = —(z 1 + z 2 + ... + z m ) (i = l,...,m) 

m 

This formulation provides a nice analytical framework: we can use the above 
equivalence of projections to consider the method of averaged projections 
directly, defined as follows. 

Algorithm 5.7 Let Si, ... , S m C E be nonempty closed convex sets. Let x 
be an initial point. For j = 1,2, . . ., let 

j m 

" l i=i 
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Simply put, at each iteration, the algorithm projects the current iterate 
onto each set individually and takes the average of those projections as the 
next iterate. In the product space formulation, this is equivalent to Xj+\ = 
Pi{Px{xj)). Expanding on the work of Pierra in [16], additional convergence 
theory for this algorithm has been examined by Bauschke and Borwein in [2] . 
Under appropriate regularity conditions, the general idea is that convergence 
of the iterated projection algorithm for two sets implies convergence of the 
averaged projection algorithm for m sets. In a similar sense, we prove the 
following result in terms of randomized projections. 



Theorem 5.8 Suppose S = is non-empty. If the randomized projec- 

tion algorithm of Theorem 5.4 is linearly convergent in expectation with rate 
a, then so is Algorithm 5.7 . 

Proof Let Xj be the current iterate, xf+ x be the new iterate in the method of 
averaged projections and xf^ be the new iterate in the method of uniformly 
randomized projections. Then note that: 



x 



i m 

AP _ c> („ \ _ tt\„RP 



,i m x/>>,) /--v.,. 

i=l 



By convexity of the Si's, it follows that 

d(xf£ v S) = d{E[xf^\ Xj lS) < E[d(xf^, S)\ Xj ] < ad(xj,S), 
by Jensen's Inequality. □ 



Hence, the method of averaged projections converges no more slowly than 
the method of uniformly random projections. In particular, under the as- 
sumptions of Theorem 5.4, the method of averaged projections converges 



with rate no larger than 1 
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